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Comment: Performance of Double-Robust 
Estimators When "Inverse Probability" 
Weights Are Highly Variable 

James Robins, Mariela Sued, Quanhong Lei-Gomez and Andrea Rotnitzky 



1. GENERAL CONSIDERATIONS 

We thank the editor Ed George for the opportu- 
nity to discuss the paper by Kang and Schaeffer. 

The authors' paper provides a review of double- 
robust (equivalently, double-protected) estimators of 
(i) the mean /i = E{Y) of a response Y when Y 
is missing at random (MAR) (but not completely 
at random) and of (ii) the average treatment effect 
in an observational study under the assumption of 
strong ignorability. In our discussion we will depart 
from the notation in Kang and Schaeffer (through- 
out, K&S) and use capital letters to denote random 
variables and lowercase letter to denote their possi- 
ble values. 

In the missing-data setting (i), one observes n 
i.i.d. copies of O = {T,X,TY), where X is a vector 
of always observed covariates and T is the indicator 
that the response Y is observed. An estimator of fi is 
double-robust (throughout, DR) if it remains consis- 
tent and asymptotically normal (throughout, CAN) 
when either (but not necessarily both) a model for 
the propensity score 7r(X) = P{T = l\X) = P{T = 
1|X, Y) or a model for the conditional mean m[X) = 
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E(Y\X) = E{Y\X, T = 1) is correctly specified, where 
the equalities follow from the MAR assumption. The 
authors demonstrate, via simulation, that when a 
linear logistic model for the propensity score and a 
linear model for the mean of Y given X are both 
moderately misspecified, there exists a joint distri- 
bution under which the OLS regression estimator 
fiOLS of n outperforms all candidate estimators that 
depend on a linear logistic maximum likelihood es- 
timate of the propensity score, including all the DR 
estimators considered by the authors. 

Near the end of their Section 1, the authors state 
that their simulation example "appears to be pre- 
cisely the type of situation for which the DR esti- 
mators of Robins et al. were developed." They then 
suggest that their simulation results imply that the 
cited quotation from Bang and Robins (2005) is 
incorrect or, at the very least, misguided. We dis- 
agree with both the authors' statement and sugges- 
tion. First, the cited quote neither claims nor implies 
that when a linear logistic model for the propen- 
sity score and a linear model for the mean of Y 
given X are moderately misspecified, DR estimators 
always outperform estimators — such as regression, 
maximum likelihood, or parametric (multiple) im- 
putation estimators — that do not depend on the es- 
timated propensity score. Indeed, Robins and Wang 
(2000) in their paper "Inference for Imputation Es- 
timators" stated the following: 

If nonresponse is ignorable, a locally semi- 
parametric efficient estimator is doubly pro- 
tected; i.e., it is consistent if either a model 
for nonresponse or a parametric model for 
the complete data can be correctly spec- 
ified. On the other hand, consistency of 
a parametric multiple imputation estima- 
tor requires correct specification of a para- 
metric model for the complete data. How- 
ever, in cases in which the variance of the 
'inverse probability' weights is very large. 
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the sampling distribution of a locally semi- 
parametric efficient (augmented inverse prob- 
ability of response weighted) estimator can 
be markedly skew and highly variable, and 
a parametric imputation estimator may 
be preferred. 

The just-quoted cautionary message of Robins and 
Wang (2000) is not far from K&S's take-home mes- 
sage. In Section 5 we show that, in the authors' sim- 
ulation example, the variance of the estimated "in- 
verse probability" weights is very large and the sam- 
pling distribution of their candidate DR estimators 
is skewed and highly variable. It follows that their 
example is far from the settings Bang and Robins 
had in mind when recommending the "routine use of 
DR estimators." Rather, their example falls squarely 
into the class for which Robins and Wang (2000) 
cautioned that a parametric imputation estimator 
may be preferable to DR estimators. 

Even prior to Robins and Wang (2000), Robins, 
Rotnitzky and colleagues had published extensive 
warnings about, and simulation studies of, the haz- 
ards of highly variable "inverse probability" weights 
(Robins, Rotnitzky and Zhao, 1995, pages 113-115; 
Scharfstein, Rotnitzky and Robins, 1999, pages 1108- 
1113), although not specifically for DR estimators. 
Due to the fact that the problem of highly variable 
weights was not the focus of their paper and had 
already been discussed extensively in earlier papers 
by Robins and colleagues. Bang and Robins (2005) 
did not repeat Robins and Wang's (2000) caution- 
ary message. In retrospect, had they done so or had 
the authors been aware of the Robins and Wang ar- 
ticle, a misunderstanding could perhaps have been 
averted. 

Whenever the "inverse probability" weights are 
highly variable, as in K&S's simulation experiment, 
a small subset of the sample will have extremely 
large weights relative to the remainder of the sam- 
ple. In this setting, no estimator of the marginal 
mean // = E{Y) can be guaranteed to perform well. 
That is why, in such settings, some "argue that in- 
ference about the mean E{Y) in the full population 
should not be attempted," to quote from the au- 
thors' discussion. Yet, surprisingly, in the authors' 
simulation experiment, the regression estimator fioLS 
performed very well with a mean squared error (MSE) 
less than any of their candidate DR estimators, all 
of which estimated the propensity score by maxi- 
mum likelihood under a linear logistic model. The 



explanation is that, whether due to unusual luck or 
to "cherry-picking," the chosen data-generating dis- 
tribution was as if optimized to have /j-ols perform 
well. Indeed, in Section 5, we "deconstruct" the cho- 
sen distribution and show that it possesses a num- 
ber of specific, some rather unusual, features that 
together served to insure jloLS would perform well 
even under K&S's misspecified models. 

Now, even were the chosen joint distribution of 
(y, T, X) optimized to have /j-ols perform extremely 
well as an estimator of E(Y) on data (TY, T, X) in 
which Y is observed only when T = 1, such opti- 
mization would not guarantee that fioLS would also 
perform well on the data ((1 — T)Y, T, X) in which Y 
is observed only when T = 0. Based on this insight, 
in Section 5, we repeat K&S's simulation study, ex- 
cept based on data ((1 — T)Y, T, X) rather than data 
{TY, T, X), and show that, indeed, fioLS is now out- 
performed by all candidate DR estimators in terms 
of both bias and MSE. 

In the analysis of real, as opposed to simulated 
data, we do not know a priori whether the features of 
the joint distribution of {Y,T,X) do or do not favor 
i^OLS- Furthermore, with highly variable "inverse 
probability" weights, we generally cannot learn the 
answer from the data, owing to poor power. This 
suggests that, with highly variable weights, a single 
estimator, whether fioLS or a single DR estimator 
of the mean /x is never adequate even with MAR 
data; rather an analyst should either "not attempt 
to make inference about the mean" or else provide 
a sensitivity analysis (in which models for both the 
propensity score and the regression of y on X and 
estimators of /i are varied). In Section 6, we sketch 
a possible approach to sensitivity analysis. 

In this discussion we ask the following question: 
can we find DR estimators that, under the authors' 
chosen joint distribution for {Y, T, X), both perform 
almost as well as fj-oLS applied to data {TY,T,X) 
and yet perform better than fioLS when applied to 
data ((1 — T)Y,T,X). In Section 4 we describe the 
principles we used to search among the set of pos- 
sible DR estimators and discuss the expected per- 
formance of various candidates. We define a gen- 
eral class of DR estimators, which we refer to as 
"bounded," that contains the DR estimators that 
perform best in the setting of highly variable "in- 
verse probability" weights. We further subdivide the 
class of bounded DR estimators into two subclasses — 
bounded Horvitz-Thompson DR estimators and re- 
gression DR estimators. We then describe various 
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scenarios which favor one subclass over the other. 
We also explain why certain DR estimators perform 
particularly poorly in settings with highly variable 
"inverse probability" weights. The performance of 
our estimators is examined in the simulations re- 
ported in Section 5, which both mimics the simula- 
tions in S&K and also repeats it but now using data 
{{l-T)Y,T,X). 

A major point emphasized by K&S was that, in 
their simulations, the regression estimator fioLS out- 
performed any DR estimator when both their model 
for the propensity score and for the regression of 
y on X (from now on referred to as the "outcome 
model") were misspecified. However, they restricted 
attention to linear logistic propensity score models. 
In Section 3, we show that fioLS is CAN for fi when 
either (but not necessarily both) a linear model for 
the inverse propensity score, l/-7r(X) = X a, or a 
linear model X P for the conditional mean E[Y\X) 
is correctly specified. That is, by definition, fioLS it- 
self is a DR estimator of fi when the inverse linear 
model tt{X) = l/{X'^a) for the propensity score is 
substituted for K&S's linear logistic model! 

In K&S's simulation experiment, the linear model 
X P and the model 7r(X) = 1/{X a) are both mis- 
specified. Yet, under their scenario, fioLS did not 
"outperform any DR estimator that is CAN when ei- 
ther the regression model X P or the model 7r(x) = 
l/{x'^a) is correctly specified," precisely because 
fi-OLS is one such DR estimator! Of course, the model 
7r(X) = l/{X^a) would rarely, if ever, be used in 
practice as the model does not naturally constrain 
tt{X) to lie in [0, 1]. Nonetheless, understanding that 
P'OLS is a DR estimator provides important insight 
into the meaning and theory of double-robustness. 

2. GENERAL FORM OF DOUBLE-ROBUST 
ESTIMATORS 

The authors note that many different DR esti- 
mators exist and give a number of explicit exam- 
ples. The authors restrict attention to MAR data 
with missing response. In this setting Robins (2000), 
Robins (2002), Tan (2006) and van der Laan and 
Robins (2003) had previously proposed a rather wide 
variety of DR estimators, in addition to the DR 
estimators of Robins and colleagues considered by 
the authors. Moreover, Scharfstein et al. (1999) and 
van der Laan and Robins (2003) provided general 
methods for the construction of DR estimators in 
models with MAR or coarsened at random (CAR) 



data. Robins and Rotnitzky (2001) described a gen- 
eral approach to the construction of DR estimators 
(when they exist) in a very large model class that 
includes all MAR and CAR models as well as certain 
nonignorable (i.e., non-CAR) missing data models. 
Recently, van der Laan and Rubin (2006) have de- 
veloped a general approach called "targeted maxi- 
mum likelihood" that has overlap with methods in 
Scharfstein et al. (1999), Robins (2000, 2002) and 
Bang and Robins (2005) in the setting of missing 
response data. We will use the general methods of 
Robins and Rotnitzky (2001) to find a candidate set 
of DR estimators among which we then search for 
ones that perform in simulations as well as or better 
than those discussed by S&K. 

Most of the DR estimators of fj, discussed by K&S 
are of the general form 



fiDRirr,m) =Fn{m{X)} +Fn 



^X) 



{Y-miX)} 



or 



/tB-D/?,(7i", rh) = Fn{m{X)} 



+ 



¥n[T/7TiX){Y-rhiX)}] 
P„{r/(7r(X))} 



where throughout, Pn(^) is a shortcut for n~^ ■ 
J27=iAi. Robins, Sued, Lei-Gomez and Rotnitzky 
(2007) show that these estimators are solutions to 
particular augmented inverse probability weighted 
(AIPW) estimating equations. The AIPW estimat- 
ing equations are obtained by applying the general 
methods of Robins and Rotnitzky (2001) to the sim- 
ple missing-data model considered by K&S. 

Quite generally, to construct m and vr we specify 
(i) a "working" parametric submodel for the propen- 
sity score vr(X) = Pr(T = 1|^) of the form 



(2) 



7r(-)G{vr(-; a) :aGM^}, 



where 7r(rE; a) is a known function, for example, 
n{x;a) = {1 + exp(— x-^a)}"^ as in S&K and, (ii) 
a working parametric model for m{X) = E(Y\X) of 
the form 



(3) 



m{-)e{m{-,p):P£W}, 



where m{x; P) is a known function, for example, 
m{x]P) = X P as in S&K. We then obtain estima- 
tors a and P which converge at rate n^'"^ to some 
constant vectors a* and /?*, which are, respectively, 
equal to the true value of a and/or P when the cor- 
responding working model is correctly specified, and 
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define m{x) = m{x]l3) and 7r(x) = 7r(x;a). [In fact, 
under mild additional regularity conditions, the rate 
v}''^ can be relaxed to vS , C,> 1/4, a fact which is 
critical when the dimensions p and q oi (3 and a are 
allowed to increase with n as n^, p < 1/2.] 

Under regularity conditions, if either (but not nec- 
essarily both) (2) or (3) is correct, /xD;j(7r,m) and 
iJ'B-DR{T^-,rn) are consistent and asymptotically nor- 
mal (CAN) estimators of /i. 

In the special case in which (a) 7r(x;a) = {1 + 
exp(— a;-^a)}~^ and m{x;P) = ^{x'^/3) where $ is a 
known canonical inverse link function, and (b) a is 
the MLE of a and (3 is the iteratively reweighted 
least squares estimator of /3 among respondents 
(throughout denoted as $reg) satisfying 

(4) ¥n[TX{Y - ^X'^^reg)}] = 0, 

we shall denote rh with rhREG ^^nd the resulting DR 
estimators as jiDR{T^,rhB.EG) and fiB-DR{Ti',rhREG)- 
When $ is the identity, (Sreg is thus the OLS es- 
timator of p. In such case, j1dr{t^-, itt-reg) is the esti- 
mator denoted fiBC-OLS in S&K and fiB-DR{^, rriREG) 
is the estimator in the display following (8) in S&K. 

3. AoLS AS A DR ESTIMATOR UNDER A 
LINEAR INVERSE PROPENSITY MODEL 

As anticipated in Section 1, in this section we 
will argue that the regression estimator ftoLS is in- 
deed a DR estimator with respect to specific work- 
ing models. Suppose that we postulate the linear 
inverse propensity model, that is, in (2) we take 
TT{x;a) = l/la^x). It follows from (4) that for any 

(5) ^n \ Z A Y - mREG{X)}\ = 0. 

Thus, the regression estimator jloLS is indeed equal 
to jlDRi'^i-',ct),rtiREG) for any a and therefore DR 
with respect to the linear inverse propensity model 
"Psub.inv and the outcome model ^{x'^P) = x'^ 13. 

To estimate a in model Psub,inv we may use either 
the estimator a.\nv or the estimator Oinv that, respec- 
tively, minimize the log-likelihood Pn[nog{7r(X; 
a)} + {1 — T}log{l — 7r(X;a)}] or squared norm 
||Fra[{ ^.j^.^s — 1}X] IP both subject to the constraints 
7r(Xj;a) > 0, i = 1, . . . , A^. Under regularity condi- 
tions, Oinv and Oinv converge in probability to quan- 
tities a*jjy and al*^ with the property that when 
model "Psub.inv is correctly specified, 7r(X;afj^y) and 
7r(A; a^*^) are equal to the propensity score P{T = 
1|A). 



4. DOUBLE-ROBUST ESTIMATORS WITH 
DESIRABLE PROPERTIES 

4.1 Boundedness 

We would like to have DR estimators of /i = E{Y) 
with the "boundedness" property that, when the 
sample space of Y is finite, they fall in the parameter 
space for /i with probability 1. Neither j1dr{t^, itT'REg) 
nor fiB-DRiT^^inREG) has this property. We consider 
two separate ways to guarantee the "boundedness" 
property. 

First, suppose that we found DR estimators that 
could be written in the IPW form 



(6) 



P„{yT/^(A)}/P„{T/^(A)} 



for some nonnegative 7r(-). Then the property would 
hold for such estimators. Specifically, the quantity 
in the last display is a convex combination of the 
observed Y-values and thus always lies in the in- 
terval [l^in, ^ax] with endpoints the minimum and 
maximum observed F-values. But, [ymirn^max] is in- 
cluded in the parameter space for /x because p is the 
population mean of Y . 

Note that division by P„{r/7r(A)} is essential to 
ensure that (6) is in [Ymirn^max]- In particular, the 
Horvitz-Thompson estimator JIht = '^n{YT/TT{X)} 
does not satisfy this property. For example, if Y is 
Bernoulli, (6) lies in [0, 1] but JIht niay lie outside 
[0, 1] . For instance, this will be the case if in the sam- 
ple there exists a unit with T = 1, 7r(A) < 1/n and 
Y = 1 since then JIht will be greater than 1. Indeed, 
when we carried out 1000 Monte Carlo replications 
of Kang and Schaeffer's simulation experiment for 
sample size n = 1000 using their misspecified anal- 
ysis model for 7r(x), we found in one particularly 
anomalous replication, a simulated unit with T =\ 
but with the unusually small estimated propensity 
tt{X) < 1/17,000. Thus, had we simulated Y from 
a Bernoulli rather than from a normal distribution, 
we would have had /Iht > 17 for this anomalous 
replication! 

The desire that an estimator falls in the interval 
[^mim^max] is in Conflict with the desire that it be 
unbiased, as we now show. Suppose the propensity 
score function 7r(x) were known. The set of exactly 
unbiased estimators of p (that are invariant to per- 
mutations of the index i labeling the units) are con- 
tained in the set {Pn[T{Y - q{X)}/7r{X) + q{X)]} 
as q{X) varies. It follows that no unbiased estima- 
tor of fj, exists for Y Bernoulli that is guaranteed 
to fall in [ymim^max]- Taking q{X) identically zero, 
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we obtain JIht =^n{YT/TT{X)}. Suppose that in 
an actual study of 1000 subjects, a rare fluctuation 
had occurred and there was a subject with T = 1 
whose propensity score tt{X) was less than 1/17,000 
so JIht > 17. We doubt any scientist could be con- 
vinced to publish the logically impossible estimate 
JIht for the mean of a Bernoulli Y, with the ar- 
gument that only then would his estimator of the 
mean be exactly unbiased over hypothetical repeti- 
tions of the study that, of course, neither have oc- 
curred nor will occur. Exactly analogous difficulties 
arise for any other choice of q{X). With highly vari- 
able weights, "boundedness" trumps unbiasedness. 
Second, the boundedness property also holds for 
DR estimators that can be written in the regression 
form 



(7) 



\{MX)} 



with m(x) = ^{x'^(3 + h[x)'^j) for some specified 
function /i(-) and an inverse link function <1> satisfy- 
ing inf Y < <l>(u) < sup Y for all u, where Y is the 
sample space of y. This follows because (i) P„{m(X)} 
falls in the interval [m^i^,m^i^x], with rh^i^ and 
^max the minimum and maximum values of m{X) 
among the n sample units and, (ii) the above choice 
of <^(-) guarantees that [mmirn^max] is contained in 
the parameter space for /i. 

Neither the estimator jJ-oRi'T^^'nT-REG) nor the 
estimator fJiB-DR{T^-,'niREG) of Section 2 satisfies the 
"boundedness" property. Note, however, 
that i1b-dr{^,^) satisfies |/iB-_D/j(7r,m)| < 
maxi=i^,„^n\Yi -m{Xi)\ + maxj=i,...,„ |m(Xi)|. Thus 
when Y is Bernoulli and m{X) = ^{X'^P) for <I> 
any inverse link with range in [0,1], we have that 
\fj'B-DR{'^,'>TT')\ < 2, so it is within a factor of 2 of 
lying in the parameter space. In contrast, when Y 
is Bernoulli, /i/j/j (vTjm), like fiHT, can be extremely 
large. 

In the next sections, we describe general approaches 
to constructing DR estimators that can be written 
in the form (6) or the form (7). Thus it is impor- 
tant to determine whether DR estimators that sat- 
isfy (6) perform better or worse than those satis- 
fying (7) when models for both m(-) and 7r(-) are 
wrong. Unfortunately, no general recommendation 
can be given because the answer will depend on the 
specific data generating process and models used to 
estimate m(-) and '7r(-). For example, suppose that 
as in Kang and Schaeffer's simulation experiment, 
Y is continuous, var(y|X) = o"^ does not depend 
on X, <1> is the identity link and we estimate the 



model X^I3 for m{X) = E(Y\X) by OLS. When 
(i) (T^/Var(y) is near zero and (ii) there exist a 
number of nonrespondent units j (i.e., units j with 
Tj = 0) whose values Xj of X lie far outside the 
convex hull of the set of values of X for the sub- 
sample of respondents, then yj and m{xj) will be 
close to one another but not to rhuEGixj) except 
if, by luck, the model E{Y\X) = X'^ [3 is so close 
to being correct that the fit of the model to the 
subsample of respondents allows successful linear 
extrapolation to X's far from those fitted. As we 
shall see, it is precisely such "luck" that explains 
the good performance of jloLS in the authors' sim- 
ulation experiment. Without such luck, the estima- 
tor P„{mfi^G(X)} may perform poorly compared to 
Pn{yT/7r(X)}/P„{T/7r(X)}, owing to unsuccessful 
linear extrapolation. On the other hand, when 
0"^/ Var(y) is close to 1 and very few units with T = 
have values of X far outside the convex hull of the 
set of X's in the respondents subsample, 
^niinREGi^)} will generally perform better than 
P„{yr/7r(X)}/P„{T/7r(X)},as the latter may be 
dominated by the large weights \/n{X) assigned 
to respondents who have both very small values of 
7r(X) and large residuals Y — m{X). 

4.1.1 Regression double-robust estimators. We re- 
fer to DR estimators satisfying (7) as regression DR 
estimators. They are obtained by replacing rhjiEG 
in fiDR{'^,'mREG) with m{X) satisfying 

^ {Y-m{X)} 



(8) 



niX) 



0. 



Here we describe three such estimators, though oth- 
ers exist (Robins et al., 2007). 

The first one, proposed in Scharfstein et al. (1999) 
and discussed further in Bang and Robins (2005), is 
the estimator in K&S's Table 8. To compute this es- 
timator one considers an extended outcome model 
of the form ^{X'^P + ip7r{X)~^) adding the covari- 
ate 7r(X)~^ (i-e., the inverse of the fitted propensity 
score). One then jointly estimates {P,^p) with (/3, (^) 
satisfyingP„[r{y-$(X^/3 + c^7r(X)-i)} [*(^)~']] = 

0. The first row of this last equation is precisely (8) 
with ifi^X) replaced with the fitted regression func- 
tion rfiEXT-REGiX) = (^{X'^P + (^7r(X)"^). Conse- 
quently, fLDR{T^,mEXT-REG) =^n{mEXT-REG{X)] . 
This estimator is CAN provided either the model 
7r(x; a) for the propensity score ■k{x) or the model 
$(x"^/3) for E{Y\X = x) is correct. In fact, it is 
CAN even if model ^{x'^ fi) is incorrect provided the 
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model ^{X (3 + ip{ir{X;a*)}~^) is correct, where 
a* is the probabihty hmit of the estimator a of a. 
In particular, as indicated in the previous subsec- 
tion, if Y is Bernoulli and $ is the inverse logit link, 
ilDR{T^,rhEXT-REG) is always in [0,1]. 

Nonetheless, when Y is continuous and $ is the 
identity, \fJiDR{T^-,rhEXT-REG)\ can be disastrously 
large when the estimated inverse probability weights 
tt{X)~^ are highly variable. Specifically, when 
7r(X)~^ is highly variable, it could very well happen 
that in most repeated samples the largest value of 
7r~^ among nonrespondents is manyfold greater than 
the largest value among the respondent subsample. 
[E.g., a typical Monte Carlo replication of Kang and 
Schaeffer under the wrong propensity score model 
had a largest tt{X)~^ of 80 in the respondent sub- 
sample but a largest 7r(X)~^ of 1800 in the non- 
respondent subsample.] In such cases, enormously 
greater extrapolation would be required with model 
^{X^/3 + y57r(X)-i} than with model $(X^/3) to 
obtain fitted values for Y in the nonrespondent sub- 
sample, clearly a problem if the extrapolation model 
${X /3-|-(/?7r(X)~^} is also wrong. This phenomenon 
explains the disastrous performance of 
A-D-r('^) ^ext-reg) observed in K&S's Table 8 when 
both the model for the propensity score and the out- 
come model are wrong. 

A second DR estimator with the regression form 
(7) is immediately obtained by estimating the pa- 
rameter /? of the model E{Y\X) = $(X^/3) with 
the weighted least squares estimator [iwLS that uses 
weights \/Ti{X). By definition, the estimator f3wLS 
satisfies 

^ -{Y-^{X''PwLs)}X 



^X) 







and consequently (8) is immediately true for m{X) 
equal to inwLs{X) = ^{X'^PwLs) when, as we al- 
ways assume in this discussion, the first component 
of X is the constant 1 . It therefore follows that when 
model ^{X'^P) has an intercept, fiDR{'^,'>TT'WLs) = 
Fn{mwLs)- The estimator p.DR{TT,'rh,WLs) is called 
ftWLS in K&S. 

With highly variable 7r(X)~^ and incorrect mod- 
els for both '/r(X) and E{Y\X), we would expect 
llDR{Tr,rhwLs) to outperform fiDRiTr, niEXT-REG) be- 
cause it does not have the severe extrapolation prob- 
lem of flDR{'^,'nT'EXT-REG)- This expectation is dra- 
matically borne out in K&S's simulations. 

Some years ago, Marshall Joffe pointed out to us 
that fiDRi'^iiTT'WLs) was double-robust and asked us 



if it had advantages compared to j1dr{t^-, ttiext-reg)- 
At the time we had not realized that 
fi'DRij^i'rhEXT-REG) would perform so very poorly 
in settings with highly variable tt{X)~'^ , so we told 
him that it probably offered no particular advan- 
tage. Based on our bad advice, Joffe never published 
a paper on fi,DR{T^-,rhwLs) as a DR estimator. To 
our knowledge, Kang and Schaeffer are the first to 
do so. We note that Kang and Schaeffer do not con- 
sider J1dr{t^-,'^wls) to be an AIPW DR estimator. 
However, the above derivation shows otherwise. 

Even fj-DR.iT^T'n^WLs) ™ay not perform well in some 
instances. For example, if Var(y|X) = a^ is con- 
stant, cr^/ Var(y) is near 1 and a number of nonre- 
spondents have X lying far outside the convex hull 
of the respondents' X values, then p'DR{'^,'rhwLs) 
may perform poorly. This is because the subjects 
who have the greatest 7r(X)~^ in the respondents' 
subsample will have enormous leverage which can 
force their residual to be nearly zero, which is a 
problem particularly when o"^/ Var(y) is near 1 and 
the model ^{X'^/3) is misspecified, as then extrap- 
olation to the X's far from the convex hull will be 
poor. 

The third DR regression type estimator is an ex- 
tension of the estimator fiipw-NR in Kang and Scha- 
effer. To compute this estimator we extend the re- 
gression model $(X-^/3) by adding the covariate t^{X) 
(rather than its inverse) to obtain <^{X'^ l3 + ipTT{X)} 
and then jointly estimate (/?, (/?) with the estimator 
(/3, (^) satisfying 



T 



n{X) 



{Y-^X^P + ^^X))} 



7r{X) 
X 



0. 



Because we have assumed the vector X has one 
component equal to the constant 1, (8) is satisfied 
with m{X) equal to m£)ji-jp\Y-^ji{X) = 
^{X'^P + if7r{X)}. Thus, flDR{-^,rhDR-ipw-NR) = 
^niiTT-DR-iPW-NR)- Furthermore, since by construc- 
tion, Pn[r{y - rh-DR-iPW-NRiX)}] = 0, then 
P'DRiTT, rriDR-iPW-NR) is also equal to fn{TY + (1 - 
T)mp)p-jpm-MR{X)}. Because 7r(A') is bounded be- 
tween and 1, adding the covariate t^{X) to model 
<I>(X-^/3), in contrast to adding 7r(X)^^, does not 
induce model extrapolation problems like the ones 
discussed above for fiDR{'^,''T^EXT-REG)- We spec- 
ulate that fiDR{T^,rhDR-iPW-NR) will behave much 
better than j1dr{t^ irhEXT-REG) and possibly simi- 
larly to fiDRi^T^WLs) when ■k{X)~^ has high vari- 
ance. Indeed, we have observed this behavior in the 
simulation study of Section 5; however, due to space 
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limitations, results for ij-DR{'^,'>TT'EXT-REG) were not 
reported as they were qualitatively similar to those 
reported in K&S. 

Finally, the estimator ilDFt.i'^,rhEXT-REG) with $ 
the identity link is also an example of a DR targeted 
maximum likelihood estimator of the marginal mean 
fi in the sense of van der Laan and Rubin (2006). 
We thus conclude from the above discussion that 
with highly variable ■7r(X)~^ and incorrect paramet- 
ric models for tt{X) and E{Y\X), certain targeted 
maximum likelihood estimators can perform much 
worse than the ad hoc estimator fiDRiT^-.'^WLs)- 

4.1.2 Bounded Horvitz-Thompson double-robust 
estimators. We refer to DR estimators satisfying (6) 
as bounded Horvitz-Thompson DR estimators. They 
are obtained by replacing n in j1b-dr{t^, rhREc) with 
TXEXT satisfying 



(9) P„ 



{mREciX) - flREc) 



T 



1 







.t^ext{X) 

where JIreg = IP?^["'-_R_BG(-^)]• We can obtain such a 
t^ext{-) by considering the extended logistic model 
'KExriX) = expit{a"^X + (ph{X)} with h{X) a user- 
supplied function and estimating 93 with 

^PROP-GREED solving 

T 



"[lexpit(a^X + (/j/i(X)) 



1 



{rhREG{X) - jloLs} 







where a is the MLE of a in the model 7r(X) = 
expit(a"'"X). Then t[ext{X) = expitja^X -1- 
^PROP-GREEDh{X)} satisfies (9) and consequently 
(j'B-DRi^EXTirhREG) is of the form (6). A default 
choice for h{X) would be rhuEGiX) — fi-oLS- 

Interestingly, the OLS estimator ftoLS can be 
viewed not only as a DR estimator, as seen in Sec- 
tion 3, but also as a bounded Horvitz-Thompson 
DR estimator! Specifically, suppose that dinvis used 
to estimate a in the propensity model Tr{X;a) as 
in Section 3, except that without imposing the con- 
straints, so that indeed, dinv solves = P„[{;^ 
1}X]. Then 

T 1 

■{mREG{X) - f^OLS) 



1 



T{X;a) 







.7r(X;din\ 
and ft- OLS is equal to the inverse probability weighted 



estimator 



7r(X;Qi 



-Y 



T 



7r(X;di 



Robins (2001) and van der Laan and Rubin (2006) 
describe particular bounded Horvitz-Thompson DR 
estimators fi^°°' that are obtained by iterating to 
convergence a sequence fi^^' of estimators that are 
not themselves bounded Horvitz-Thompson estima- 
tors. However, the Robins (2001) estimator performed 
poorly in our simulations (results not shown) and 
no simulation study of the van der Laan and Rubin 
(2006) estimator has been published to our knowl- 
edge with highly variable inverse probability weights. 
In fact, van der Laan and Rubin (2006) describe 
an estimator fi^'^' that is simultaneously a bounded 
Horvitz-Thompson and a regression DR estimator 
that is obtained by iterating to convergence a se- 
quence fi^^' of estimators without this dual prop- 
erty. Again we do not know of a simulation study 
showing that the estimator generally performs well 
in practice with highly variable inverse probability 
weights. 

5. SIMULATION STUDIES 

To investigate the nature of the surprisingly good 
performance of the regression estimator fioLS in the 
simulation study of K&S and to evaluate the perfor- 
mance of the additional estimators described in Sec- 
tion 4, we replicated the simulation study of K&S. 
Table 1 reports the Monte Carlo bias, variance and 
mean squared error for twelve different estimators 
of /i = 210, sample sizes n = 200 and 1000 and the 
four possible combinations of model specifications 
for the propensity score and the conditional mean 
of the response given the covariates (the latter re- 
ferred to throughout as the outcome model). 

The estimators reported in rows 1 and 9, rows 3 
and 11, rows 4, 12, 17 and 22 and rows 5, 13, 18 and 
23 are, respectively, the estimators jJ-oLS^iJ-iPW-POP, 
JJ-BC-OLS and fiwLS investigated by K&S. Through- 
out we use the notational conventions of Sections 
2-4, and thus we rename JIbc-ols and fiwLS with 
fiDRiT^^rfiREG) and /iDi{(7r,mH'L5), respectively. The 
estimator fJiHT is the Horvitz-Thompson type esti- 
mator Pn{ry/7r(X)}. All remaining estimators are 
DR estimators of ^ and are defined in Sections 2 and 
4. 

When both working models are correct, theory 
indicates that all DR estimators are CAN, asymp- 
totically equivalent, and efficient in the class of esti- 
mators that are CAN even if the outcome model is 
incorrect. We were not surprised then to find that 
all DR estimators reported in rows 4-8 of Table 1 
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Table 1 
Results for simulation study as in K&S 







Sample size 200 






Sample size 1000 




Row Estimator 


Bias 


Var 


MS^E 


Ulas 


Var 


M«E 


Both models right 














1 AoiS 


0.13 


5.97 


5.98 


-0.03 


1.41 


1.41 


2 JIht 


-0.08 


148.92 


148.92 


0.17 


26.46 


26.49 


3 fiipw-pop 


-0.06 


14.12 


14.13 


-0.03 


3.43 


3.43 


4 flDR{n,mREG) 


0.13 


5.96 


5.98 


-0.03 


1.41 


1.41 


5 flDR{-K,mwLs) 


0.13 


5.97 


5.98 


-0.03 


1.41 


1.41 


6 flDR{n,rhDR-IPW-NR) 


0.13 


5.97 


5.98 


-0.03 


1.41 


1.41 


7 jlB-DR{T^,'mREG) 


0.13 


5.96 


5.98 


-0.03 


1.41 


1.41 


8 jiB-DR{T^EXT,rhREG) 


0.13 


5.97 


5.98 


-0.03 


1.41 


1.41 


Botli models wrong 














9 jloLS 


-0.39 


10.91 


11.06 


-0.83 


2.19 


2.88 


10 [iHT 


16.87 


4110.86 


4395.39 


38.97 


39933 


41452 


11 JJ-IPW-POP 


1.67 


73.39 


76.17 


4.81 


108.86 


131.95 


12 ilDR{T^,mREG) 


-4.90 


145.93 


169.91 


-13.91 


6853.68 


7047.12 


13 jlDR{'K,mwLs) 


-2.01 


10.70 


14.74 


-2.98 


2.20 


11.08 


14 p.DR{'rt,ThDR-IPW-NR) 


-1.76 


11.82 


14.90 


-2.49 


1.81 


8.02 


15 flB-DR{TT,mREG) 


-3.82 


40.07 


54.65 


-8.03 


128.61 


193.13 


16 flB-DR{T^EXT,rhREG) 


-2.25 


11.77 


16.82 


-3.33 


3.44 


14.54 


TT-model right, outcome model wrong 














17 flDR{Tr,rhREG) 


0.55 


11.82 


12.12 


0.07 


2.81 


2.82 


18 flDRiTT,mwLs) 


0.65 


8.82 


9.24 


0.16 


1.90 


1.93 


19 p-DRi^jiriDR-IPW-NR) 


0.06 


7.39 


7.40 


-0.10 


1.58 


1.59 


20 fj.B-DR{-^,iriREG) 


0.56 


11.51 


11.83 


0.08 


2.79 


2.80 


21 flB-DR{T^EXT,rhREG) 


0.53 


9.41 


9.69 


0.11 


2.08 


2.09 


TT-model wrong, outcome model right 














22 iJ.DR{n,mREG) 


0.14 


5.95 


5.97 


-0.03 


1.77 


1.77 


23 flDRiTT,mwLs) 


0.13 


5.97 


5.98 


-0.03 


1.41 


1.41 


24 p-DRiT^jrhDR-IPW-Np) 


0.13 


5.97 


5.98 


-0.03 


1.41 


1.41 


25 fj.B-DR{'!i',rhREG) 


0.13 


5.96 


5.97 


-0.02 


1.43 


1.43 


26 flB-DR{T^EXT,mREG) 


0.13 


5.96 


5.98 


-0.02 


1.42 


1.42 



performed identically and were more efficient than 
the inefficient IPW estimators fiipw-POP and fiHT- 
However, the near-identical behavior of the regres- 
sion estimator fioLS caught our attention. The es- 
timator fioLS is the maximum likelihood estimator 
of /i, and hence efficient, in a semiparametric model 
that assumes a parametric form for the conditional 
mean of Y given the covariates. Thus, we would 
have expected it to have smaller variance than that 
of the DR estimators of //, because when both the 
propensity score model and the regression model are 
correct, the latter attains the semiparametric vari- 
ance bound in the less restrictive (nonparametric) 
model that does not impose restrictions on the con- 
ditional mean of Y. A closer examination of the data 
generating process used by K&S explains this un- 
usual behavior. Under S&K data generating process 
Yi = 210 + 13.7Z* + Ei, where Z* = 2Zu + Ei=2^ii> 



with Zji^j = 1, ... ,4, and £i mutually independent 
A^(0, 1) random variables. But under this process 
Z*, and hence Zi = [Zu, Z2i., Z^i, Z/^i) , is an essen- 
tially perfect predictor of Yi: the residual variance 
var(Yi|Z*) is equal to var(yj)/195. This striking fea- 
ture of K&S data generating process is illustrated 
in Figure 1. The figure shows a scatterplot of Y 
versus the predicted values from the fit of the cor- 
rect outcome model to the respondents in a ran- 
dom sample of n = 200 units. Dark dots correspond 
to data points of respondents. White dots corre- 
spond to the simulated missing outcomes Yi of the 
nonrespondents plotted against the predicted val- 
ues Z[f5. The white dots follow nearly perfectly a 
straight line through the origin and with slope 1: 
the predicted values are essentially perfect predic- 
tors of the missing outcomes! When the outcome 
and propensity score models are correctly specified, 
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the asymptotic variance of the DR estimator is equal 
to var(y) + var[7r(Z){l -7r(Z)}-ivar(y|Z)]. When 
Z is a perfect predictor of Y, this variance reduces 
to var(y), the variance of the standardized distribu- 
tion of fipuLL, the sample mean of Y of respondents 
and nonrespondents. This is not surprising because 
it is well known that, when the outcome model is 
correctly specified, a DR estimator asymptotically 
extracts all the information available in Z to pre- 
dict Y. Since the regression estimator ftoLS cannot 
be more efficient than ftpuLL, we conclude that fioLS 
and the DR estimators should have nearly identical 
variance when Z is an almost perfect predictor of 
Y and indeed this variance should be also almost the 
same as that of the infeasible estimator fipuLL- In 
our study we had simulated the outcomes of the non- 
respondents. Thus, we were indeed able to compute 
fj'FULL and its Monte Carlo variance. As expected, 
the Monte Carlo variance of JIfull was essentially 
the same as that of ftoLS for both sample sizes. 

Theory also indicates that the IPW estimators 
fiiPW-POP and iIht of rows 2 and 3 should be CAN. 
However, in our simulations, these estimators were 
nearly unbiased but their sampling distribution was 
skewed to the right and had very large variance. 
Figure 2 shows smooth density estimators for these 
sampling distributions for sample sizes n = 200 and 
n = 1000. The skewness and large variance of the 
IPW estimators were caused by few samples which 
had respondents with large values of Y and very 
large weights l/vr. Specifically, in most samples, the 
true vr values of the respondents were not too small, 
and consequently the weights I/tt not too large, pre- 
cisely because by the very definition of vr, having a 
respondent with a small vr is a rare event. In the 
data generating process of K&S, 7r{Z) is negatively 
correlated with Y; the correlation is roughly equal 
to —0.6. Thus, in most samples, the l/vr-weighted 
mean of the Y values of the respondents tended to 
be smaller than fx. However, in a few samples, some 
anomalous respondent had a small value of vr. In 
the computation of fiHT, this anomalous respondent 
carried an unusually large weight l/vr and because 
his Y value tended to be larger than the mean fx, the 
estimator fiHT in those rare samples tended to be 
substantially larger than //. The skewness lessens as 
the sample size increases because with large sample 
sizes, the number of samples which have respondents 
with small values of vr also increases. The skew- 
ness is also substantially less severe for fiipw-POP 
compared to that of JIht also as expected since. 



as discussed in Section 4.1, in any given sample, 
\f^iPW-POp\ is bounded by the largest observed |y| 
value. 

Although the Monte Carlo sampling distribution 
of the IPW estimators gives a rough idea of the 
shape of the true sampling distribution of these esti- 
mators, neither the Monte Carlo bias nor the Monte 
Carlo variance should be trusted. One thousand repli- 
cations are not enough to capture the tail behavior 
of highly skewed sampling distributions, and as such 
cannot produce reliable Monte Carlo estimates of 
bias, much less of variance. 

Turn now to the case in which the propensity score 
model is correct but the outcome model is incor- 
rect. Theory indicates that the DR estimators of 
rows 17 to 21 of Table 1 should be CAN. However, 
in our simulations nearly all the DR estimators were 
slightly biased upward. Nevertheless, all DR estima- 
tors performed as well as or better, in terms of MSE, 
than the OLS estimator of row 9. 

Consider now the case in which the propensity 
score model is incorrect but the outcome model is 
correct. Once again, the almost identical performance 
of all DR estimators in rows 22-26 of Table 1 with 
that of the OLS estimator of row 1 is no surprise 
after recalling that Z is a perfect predictor of Y. 
Specifically, the fact that Z* is a nearly perfect pre- 
dictor of Yi implies that rh{Zi) is almost identi- 
cal to the outcome Yi regardless of whether unit i 
is a respondent or a nonrespondent and regardless 
of whether rh{Zi) was fit by ordinary least squares 
or by weighted least squares. Thus, the average of 
rh{Zi) is essentially the same as jj-full and the sum 
of TjTTj" (Yi — m{Zi)) is almost zero regardless of the 
model under which vr was computed. Consequently, 
all DR estimators must be nearly the same as the 
infeasible full data sample mean fipuLL- 

Finally, turn to the case in which both propen- 
sity score and the outcome models are wrong. The 
performance of the IPW estimators is disastrous as 
well as that of the DR estimator in row 12 and, to a 
lesser extent, that of the estimator in row 15. Figure 
3 shows smooth density estimators of the sampling 
distribution of these four estimators when the sam- 
ple size is 1000. The estimators fiuT and fi-ipw-POP 
have distributions heavily skewed to the right while 
the estimators fj.DR{TT,mREG) and jiB-DRiT^^rhREG) 
have distributions heavily skewed to the left. The 
skewness is far more dramatic for the estimators 
fiHT and iJ-DR{'^,'>TT'REG) than for their counterparts 
fi-lPW-POP and fLB-DRiT^,mREG), reflecting the fact 
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Fig. 1. K&S simulation experiment. Outcomes vs predicted values. Sample size 200. Top: correct y model. Bottom: wrong y 
model. Dashed line is the line Y = X. Dark dots: respondents. White dots: nonrespondents. 
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Sample size 200 



Sample size 1000 
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Fig. 2. Distributions of JIht and JIipw-pop under correct propensity score models 



that fJ-ipw-POP and /i_B-D_R(vr,m/j^G) are bounded 
in the sense described in Section 4.1 while fiHT and 
jJ'DRiT^-.'rnREG) are unbounded. [Indeed, to avoid dis- 
tortions, in constructing the density plots of fiHT 
and fiDR{TT,^REG) we have omitted the extreme 
values of 5873 and —2213, respectively, from one 
simulation replication.] Rows 12 and 14 of Table 
1 report that the Monte Carlo bias and variance 
indeed are even larger for n = 1000 than for n = 
200. The extreme distribution skewness and the in- 



crease in bias and variance with sample size are ex- 
plained as follows. As noted earlier, even when the 
vr's are estimated from a correct model, the distri- 
bution of fiHT and (xipw-POP will tend to be skewed 
to the right when l/vr is positively correlated with 
Y because of the presence of a few unusual sam- 
ples with anomalous respondents with large Y val- 
ues and small vr values. Now, because of the nature 
of the wrong analytic propensity score model used 
in the simulations, the estimated vr's corresponding 
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Fig. 3. Distributions of /Iht, p.iPW-POP,P-DR{^,inij.EG) and p,B-DR{Ti',rhREG) under incorrect propensity score and outcome 
models. 



to the anomalous units in the unusual samples were 
many times smaller than the true vr's. As a conse- 
quence the, usually large, values of Y of the anoma- 
lous units essentially determined the values of fiHT 
and (j-ipw-POP in the unusual samples and conse- 
quently, exacerbated even more the skewness of the 
Monte Carlo sampling distribution of the IPW esti- 
mators. The larger bias and variance when n = 1000 
than when n = 200 were due to two replications 
with sample size 1000 in which the values of the 



estimators were extreme [specifically, /Iht = 1475 
and 2884, and fi'DRi'^,'rnREG) = —2213 and —175]. 
These outlying values were caused by one anoma- 
lous nonrespondent in each sample with large val- 
ues of Y (the second largest Y values in one sample 
and the largest in the other). For these units, the 
l/vr values were 38.7 and 50.9 but l/vr were 17,068 
and 399, respectively. When the two samples with 
these anomalous units were removed, the variance 
of the estimators fiHT and fj-DR{^,TTT'REG) decreased 
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to 6729 and 890, respectively. The paradoxical in- 
crease in Monte Carlo variance with sample size is 
but another proof that the Monte Carlo variance 
in simulations with 1000 replications is not a reli- 
able estimator of the true variance for estimators 
with highly skewed distributions. The different di- 
rectionality of the skewness of the IPW and DR es- 
timators is explained as follows. In the computation 
of fiDR{-^,mREG) and jiB-DBiT^MREG) we inverse 
probability weight the values of {itireg — Y). Con- 
sequently since, as indicated below, under K&S's 
wrong analytic outcome model, rhuEG was reason- 
ably bounded; thus, in the few unusual samples, the 
anomalous units with small vr's had large and nega- 
tive values of {rhjiEG — ^) and produced extremely 
small values of the DR estimators. 

The performance of the remaining DR estimators 
in rows 13, 14 and 16 is heterogeneous. Some, though 
still biased, have bias and variance orders of mag- 
nitude smaller than the variance of the estimators 

fJ'DR{T^,rhjiEG) and fJ.B-DR{T^,'mREG)- 

In a second simulation experiment described be- 
low, the relative performance of the DR estimators 
was somewhat different than in this simulation study 
and, as we explain later, better than that of the re- 
gression estimator fioLS- This attests to the obvious 
fact that when the propensity score and outcome 
models are both incorrect we cannot expect to find 
a single clear winner. The relative performance of 
the estimators will very much depend on the data 
generating process and the nature of the model mis- 
specifications. 

To understand why the regression estimator ftoLS 
performed so remarkably well when both models 
were wrong, we first note that because the outcome 
model was a linear regression model with an inter- 
cept fitted by ordinary least squares in the respon- 
dent subsample, the sum of the predicted values X' (3 
and the sum of Y in the respondent subsample are 
the sa me. Thus, fipLs = {nohs/n)Yohs + (nmiss/n) 
(X'/3)j^igg, where {X'P)^^^^ is the average of the pre- 
dicted values for the missing outcomes. The bias of 
ftOLS therefore depends on the bias of {X'(3)^^^^ as 
an estimator of the mean of Y in the nonrespon- 
dent subpopulation. If, due either to good luck or 
"cherry picking," the prediction function x'f3 from 
a misspecified regression model x'P successfully ex- 
trapolates to the covariates of nonrespondents, even 
when these are far from the conv ex hul l of covari- 
ates in the respondent subsample, iX'P)^-^^^ — Y^iss 



will be roughly centered around 0, and consequently 
fj-OLS will be a nearly unbiased estimator of the 
mean of Y. We now show this phenomenon explains 
the excellent performance of jj-qls in Kang & Scha- 
effer's simulation. In Figure 4 we plotted the out- 
comes Y versus the predicted values X'P in the pre- 
viously mentioned unusual sample of size 1000 with 
both the propensity and outcome models misspeci- 
fied, where fiDR{T^, rhREG),fi-B-DR{T^, rriREG) and the 
IPW estimators did disastrously due to the pres- 
ence of one anomalous unit with extremely small 
71". The dark dots correspond to the observed data 
values of the respondents. White dots correspond 
to the actual simulated missing outcomes Y of the 
nonrespondents plotted against the predicted val- 
ues X 13. We can see that the predicted values of 
the nonrespondents are reasonably centered around 
the straight line even for those points with predicted 
values far from the predicted values of the respon- 
dents. In this sample, jloLS was 205.78, a far more 
reasonable value than those obtained for the IPW 
and just-mentioned DR estimators. 

To demonstrate that fiQis can have a substan- 
tially worse performance than the DR estimators, 
we conducted a second simulation experiment. This 
second experiment, like our first, redid K&S's simu- 
lation by generating the data {Y,T^X) from K&S's 
chosen distributions. However, in our second experi- 
ment we analyzed the data ((1 — T)Y, T, X) in which 
Y is observed only when T = 0, rather than the data 
{TY, T, X) that was analyzed by us in our first ex- 
periment and by K&S in their paper. [To do so, 
since the data ((1 — T)Y,T,X) can be recoded as 
((1 — T)Y,1 — T,X), we simply recompute each of 
the estimators reported in Table 1 except now we 
everywhere replace n and T by 1 — tt and 1 — T.] 
The results are displayed in Table 2. We observe 
that, with both models wrong, the bias and MSE of 
P-OLS now exceed those of any DR estimator! 

As in our first experiment, due to the extreme 
variability in the estimated "inverse probability" 
weights, the DR estimators appear to have consider- 
able finite sample bias, especially at the smaller sam- 
ple size of 200, when the propensity model is correct 
but the outcome model is wrong. In fact, this bias is 
larger than it was in the first simulation experiment, 
which was to be expected as the variability in the 
estimated "inverse probability" weights was greater 
in the second than the first experiment (data not 
shown) . 
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Fig. 4. Y vs predicted values in one sample of size 1000 generated under K&S experiment. Dashed line is the line Y= X. 
Dark dots: respondents. White dots: nonrespondents. 



6. SENSITIVITY ANALYSIS 

Consider again the missing-data setting with the 
mean /i oi Y as the parameter of interest. When 
the covariate vector X is high dimensional, one can- 
not be certain, owing to lack of power, that a chosen 
model for the propensity score is nearly correct, even 
if it passes standard goodness-of-fit tests. Therefore 
a large number of models for the propensity score 
with different subsets of the covariates, different or- 
ders of interactions and different dimensions of the 
parameter vector should be fit to the data. Similarly, 
many different outcome models should be fit. This 
raises the question: once fit, how should these many 
candidate models be used in the estimation of the 
mean of y? 

One approach is to use modern techniques of model 
selection to choose a single propensity and outcome 
model. Specifically, there has been a recent outpour- 
ing of work on model selection in regression. This 
work has shown that one can use cross-validation 
and/or penalization to empirically choose, from 
among a large number of candidates, a model whose 
predictive risk for the response variable in the re- 
gression is close to that of the best candidate model. 
In fact, van der Laan (2005) has proposed that k- 
fold cross-validation should be routinely employed 



to select the model for the propensity score and for 
the outcome regression that are to be used in the 
construction of a DR estimator. 

An alternative approach which we are currently 
studying is the following. Suppose one has fit Jp 
propensity score models and Jq outcome models. 
For a favorite DR estimator ft, define fuj as the 
DR estimator that uses the fitted values from the 
ith propensity model and the jth outcome model. 
Now, if the ith. propensity model is correct, all Jo 
estimators in the set Sp^i = {flij'jj = 1, . . . , Jo} will be 
CAN estimators of fi. Thus, an a-level test of the ho- 
mogeneity hypothesis Hpi : E {fin) = E {fiij) for all 
j G {2, . . . , Jo} [where E {■) stands for large sample 
mean, i.e., the probability limit of •] is also an a-level 
goodness-of-fit test for the propensity model that is 
directly relevant to its use in a DR estimator of jjl. 
Similarly if the jth outcome model is correct, all Jp 

1, . . . , Jp} will be 



jji 



\t- 



estimators in the set 8o^j = {/*: 
CAN for II and a test of the homogeneity hypothe- 
sis Hoj : E^{jlij) = E'^{fiij) for alH S {2, . . . , Jp} is a 
test of fit for the outcome model. This suggests that 
one could choose as a final estimator of jjl the DR 
estimator fii*j*, where i* is the i for which the test 
of the hypothesis Hpi gave the largest p- value and j* 
is the i for which the test of the hypothesis Hoj gave 
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Table 2 
Results for simulation study as in K&S but with the roles of T and 1 — T reversed 



Row 



Estimator 





Sample size 200 




Bias 


Var 


Mi^E 


0.12 


5.96 


5.98 


-0.46 


49.14 


49.36 


0.45 


14.76 


14.96 


0.12 


5.96 


5.98 


0.12 


5.96 


5.97 


0.12 


5.96 


5.97 


0.12 


5.96 


5.97 


0.12 


5.96 


5.97 


4.97 


7.97 


32.68 


0.55 


40.27 


40.57 


3.92 


9.67 


25.03 


3.33 


8.79 


19.90 


3.17 


8.21 


18.24 


3.11 


8.21 


17.90 


3.32 


8.70 


19.69 


3.30 


8.68 


19.55 


0.71 


12.60 


13.11 


0.99 


8.04 


9.02 


0.71 


7.26 


7.76 


0.75 


11.21 


11.76 


0.86 


10.38 


11.12 


0.12 


5.96 


5.97 


0.12 


5.96 


5.97 


0.12 


5.96 


5.97 


0.12 


5.96 


5.97 


0.12 


5.96 


5.97 





Sample size 1000 




Bias 


Var 


M«E 


-0.03 


1.41 


1.41 


-0.24 


8.45 


8.51 


0.05 


3.11 


3.12 


-0.02 


1.41 


1.41 


-0.02 


1.41 


1.41 


-0.02 


1.40 


1.41 


-0.02 


1.41 


1.41 


-0.02 


1.40 


1.41 


4.97 


1.91 


26.62 


0.39 


6.27 


6.43 


3.68 


2.22 


15.79 


3.07 


2.12 


11.53 


2.81 


1.97 


9.84 


2.64 


1.97 


8.94 


3.04 


2.10 


11.34 


3.01 


2.10 


11.16 


0.14 


2.96 


2.98 


0.23 


1.92 


1.97 


0.18 


1.72 


1.75 


0.14 


2.76 


2.78 


0.18 


2.71 


2.74 


-0.02 


1.40 


1.41 


-0.02 


1.40 


1.41 


-0.02 


1.40 


1.41 


-0.02 


1.40 


1.41 


-0.02 


1.40 


1.41 



Both models right 

1 fl-OLS 

2 flHT 

3 fiipw-pop 

4 ij.DR{T^,mREG) 

5 fl.DRifc,iriWLs) 

6 flDRiT^jlTlDR-IPW-NR) 

7 jlB-DR{TT,lflREG) 

8 fl,B-DR{T^EXT,mREG) 

Both models wrong 

9 P^OLS 

10 flHT 

11 fllPW-POP 

12 fiDR{Ti',rhREG) 

13 fiDRiTTjrhwLs) 

14 flDR{ji,rhDR-IPW-NR) 

15 fl,B-DR{TT,mREG) 

16 fl,B-DR{TTEXT,rhREG) 

TT-model right, outcome model wrong 

17 ilDR{'ri-,rhREG) 

18 p.DR{fc,rhwLs) 

19 flDRij^jiriDR-IPW-NR) 

20 fJ.B-DR{TT,rhREG) 

21 ij.B-DR.{j^EXT,rhREG) 

TT-model wrong, outcome model right 

22 flDR{ji,'thREG) 

23 fiDRiTTjrhwLs) 

24 p.DR{'^,rhDR-IPW-NR) 

25 flB-DRiTTjrhREG) 

26 flB-DRiTTEXTjiriREG) 



the largest p-value. However, this method of select- 
ing i* and j* is nonoptimal for two reasons. First, it 
could easily select a misspecified propensity model i 
for which the power of the test of the hypothesis Hpi 
is particularly poor and similarly for the outcome re- 
gression. This remark implies that some measure of 
the spread of the elements of Sp^i and £oj should 
also contribute to the selection of i* and j* . Second, 
the method does not exploit the fact that if i* and j* 
are correct, then E^{jlij*) = E'^{jli*j) for all i and j, 
suggesting that an optimal method should select i* 
and j* jointly. Alternative approaches for selecting 
i* and j* will be reported elsewhere. In any case, 
the very fact that input to the selection algorithm 
requires the matrix jlij provides an informal sensi- 
tivity analysis; we directly observe the sensitivity of 
our DR estimator to the choice of propensity and 
outcome regression model. 



The approach just described could also be com- 
bined with the model selection approach. Specifi- 
cally, one first uses cross-validation to choose not 
one but rather Jp and Jo propensity and outcome 
models (the ones with the Jp and Jo lowest cross- 
validated risk estimates) out of a much larger num- 
ber of candidate models and next, one uses these 
Jp +Jo models as input for the approach described 
above. Sensitivity to the choice of the particular DR 
estimator might be included by using a number of 
different DR estimators and selecting among or aver- 
aging over DR estimators that give similar estimates 

van der Laan (2005) has proposed some new ap- 
proaches to model selection for DR estimation that 
go beyond his above-mentioned approach, which we 
do not discuss here due to space limitations. Finally, 
Wang, Petersen, Bangsberg and van der Laan (2006) 
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have proposed using the parametric bootstrap to 
study the sensitivity of DR estimates to highly vari- 
able "inverse probability" weights. 

7. FURTHER CONSIDERATIONS 

Estimation of Causal Effects 

K&S briefly touch on the problem of estimating 
the difference of the outcome means corresponding 
to two treatments in an observational study under 
ignorability. This difference is often referred to as 
the average causal effect (ACE). K&S view the prob- 
lem of estimating ACE essentially as two missing- 
data problems, each one regarding the outcomes of 
subjects that do not follow the treatment of con- 
cern as missing. The difference of the DR estima- 
tors of the separate means serves as an estimator of 
the mean difference ACE. However, the difference 
of the two DR estimators will have poor small sam- 
ple behavior if there is incomplete overlap of the 
estimated propensity scores in the treated and un- 
treated. In fact, in the presence of incomplete over- 
lap, most investigators argue against trying to es- 
timate ACE and in favor of estimating the causal 
effect in the subpopulation of subjects with overlap- 
ping propensity scores. However, assuming the ACE 
parameter is of some substantive interest, Robins et 
al. (2007) suggest an alternative to reporting the dif- 
ference of two DR estimators of the separate means. 
Their approach is based on fitting a linear semipara- 
metric regression model for the unknown conditional 
effect function encoding the dependence of the con- 
ditional treatment effect on the baseline covariates 
X. Their model has the property that it is guaran- 
teed to be correctly specified under the null hypoth- 
esis that the conditional effect function is the zero 
function. Robins et al. (2007) show that this strat- 
egy results in estimators of the ACE that greatly 
outperform any estimator based on the difference of 
double-robust estimators, whenever the model for 
the conditional effect function is correctly specified; 
in particular, when the aforementioned null hypoth- 
esis is true. 

Multiple Robustness 

Consider again the MAR missing-data model with 
X very high dimensional (say 20-100 continuous co- 
variates) so we cannot possibly hope to model the 
propensity score or the outcome regression nonpara- 
metrically. Double-robust estimators of the mean // 



of Y are n^' ^-consistent if either one of two paramet- 
ric models is correct but inconsistent if both mod- 
els are misspecified. This property of DR estimators 
seems unsatisfactory, as it means that one does very, 
very well if one of the two models is correct but can 
do very, very poorly when both are incorrect. Might 
we do better? 

Define an estimator to be m-rohust for /z at rate 
rf^ if the estimator is n'^-consistent for /i when any 
one of m parametric models is correct, but inconsis- 
tent if all m models are misspecified. A DR estima- 
tor is then a 2-robust estimator with a = 1/2. Our 
view is that an rri-robust estimator with m large, 
even though this may require a to be much smaller 
than 1/2 and so entail a much slower rate of con- 
vergence, would usually be preferable to a DR es- 
timator for the following two reasons. First, if one 
uses an ?TT,-robust rather than a DR estimator, one 
is more likely to be using a consistent estimator of 
/i (as it is always more likely that at least one of m, 
rather than one of two, models is correct). Second, 
the slower rate of convergence (under the assump- 
tion one of the m models is correct) will result in 
wider nominal confidence intervals than the usual 
nominal intervals of length l/v}''^ associated with 
a DR estimator. Such a wide interval seems to us a 
more appropriate measure of the actual uncertainty 
about /u, more accurately refiecting the fact that our 
estimator could even be inconsistent if all m models 
are incorrect. 

These observations raise the following questions. 
Do m-rohust estimators exist for arbitrarily large m 
if we are willing to sacrifice v}'"^- consistency for n"- 
consistency with a perhaps much smaller than 1/2? 
What is the maximum value of a we can achieve for 
a given ml If m-rohust estimators exist for m> 2, 
how do we construct them? Answers to these ques- 
tions can be found in Robins, Li, Tchetgen and van 
der Vaart (2007), where it is shown that, under weak 
assumptions, (i) m-rohust estimators exist for all m, 
(ii) TTi-robust estimators are (m— 1) dimensional U- 
statistics, for which explicit closed-form expressions 
are given, and (iii) the maximal possible a is of- 
ten less than 1/2 and sometimes much less. How- 
ever, the finite sample properties of m-rohust esti- 
mators have yet to be studied even by simulation. 
Thus we will have to wait to see if they are as use- 
ful in practice as theory would indicate they should 
be. 
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